Unknown Words Modelling in Training and Using Language Models for Russian LVCSR System
نویسندگان
چکیده
The paper considers some peculiarities of training and using N-gram language models with open vocabulary. It is demonstrated that explicit modeling of the probability distribution of out-of-model (unknown) words is necessary in this case. Two known techniques for this modeling are considered and a new technique with several advantages is proposed. We present experiments which demonstrate the consistency of the proposed approach.
منابع مشابه
Dictionary of Abstract and Concrete Words of the Russian Language: A Methodology for Creation and Application
The paper describes the first stage of a project on creating an electronic dictionary with numerical estimates of the degree of abstractness and concreteness of Russian words. Our approach is to integrate data obtained from several different sources: text corpora, psycholinguistic experiments, published dictionaries, markers of abstractness (certain suffixes) and a translation of a similar dict...
متن کاملSub-word-based language models for speech recognition: implications for spoken document retrieval
Large Vocabulary Continuous Speech Recognition (LVCSR) is dependent on language models to constrain the acoustic search space by delivering an a priori probability of possible word sequences. A language model for LVCSR models a spoken document as a time series; it predicts language as a sequence of units drawn from a fixed alphabet. The classic LVCSR language model is an n-gram model that model...
متن کاملSpoken Term Detection for Persian News of Islamic Republic of Iran Broadcasting
Islamic Republic of Iran Broadcasting (IRIB) as one of the biggest broadcasting organizations, produces thousands of hours of media content daily. Accordingly, the IRIBchr('39')s archive is one of the richest archives in Iran containing a huge amount of multimedia data. Monitoring this massive volume of data, and brows and retrieval of this archive is one of the key issues for this broadcasting...
متن کاملRhyming Compounds as Elements of a Language Game (In Russian and English Languages)
The article is devoted to the study of composite rhyming compounds as a means of word formation games. It explores the place of this category of words in the lexical system and peculiarities of their use in the Russian and English languages. Authors of the article represent compound words as a special lexical subgroup. On the specific publicistic material are revealed the peculiarities of compo...
متن کاملIncreasing the Effectiveness of Russian Language Teaching for Special Purposes (to the Problem of Integration of Language Training with Information Technology Courses)
The article is devoted to the problem of increasing the efficiency of language teaching for the special purposes of foreign students in studying Russian at a technical university. Particular attention is paid to the training of foreign students in the skills of working with information using the latest computer technology. The conclusions of the work are based on the analysis of the results of ...
متن کامل